Outline

  1. IPython and IPython Notebooks
  2. Numpy
  3. Pandas

Python and IPython

  • python is a programming language and also the name of the program that runs scripts written in that language.
  • If you're running scripts from the command line you can use either ipython with something like ipython my_script.py or python with something like python my_script.py
  • If you're using the command line interpreter interactively to load and explore data, try out a new package, etc. always use ipython over python. This is because ipython has a bunch of features like tab completion, inline help, and easy access to shell commands which are just plain great (more on these in a bit).

IPython Notebook

  • IPython notebook is an interactive front-end to ipython which lets you combine snippets of python code with explanations, images, videos, whatever.
  • It's also really convenient for conveying experimental results.
  • http://nbviewer.ipython.org

Notebook Concepts

  • Cells -- That grey box is called a cell. An IPython notebook is nothing but a series of cells.
  • Selecting -- You can tell if you have a cell selected because it will have a thin, black box around it.
  • Running a Cell -- Running a cell displays its output. You can run a cell by pressing shift + enter while it's selected (or click the play button toward the top of the screen).
  • Modes -- There are two different ways of having a cell selected:
    • Command Mode -- Lets you delete a cell and change its type (more on this in a second).
    • Edit Mode -- Lets you change the contents of a cell.

Aside: Keyboard Shortcuts That I Use A Lot

  • (When describing keyboard shortcuts, + means 'press at the same time', , means 'press after'
  • Enter -- Run this cell and make a new one after it
  • Esc -- Stop editing this cell
  • Option + Enter -- Run this cell and make a new cell after it (Note: this is OSX specific. Check help >> keyboard shortcuts to find your operating system's version)
  • Shift + Enter -- Run this cell and don't make a new one after it
  • Up Arrow and Down Arrow -- Navigate between cells (must be in command mode)
  • Esc, m, Enter -- Convert the current cell to markdown and start editing it again
  • Esc, y, Enter -- Convert the current cell to a code cell and start editing it again
  • Esc, d, d -- Delete the current cell
  • Esc, a -- Create a new cell above the current one
  • Esc, b -- Create a new cell below the current one
  • Command + / -- Toggle comments in Python code (OSX)
  • Ctrl + / -- Toggle comments in Python code (Linux / Windows)

Check more at [here](http://johnlaudun.org/20131228-ipython-notebook-keyboard-shortcuts/)

Numpy

Numpy is the main package that you'll use for doing scientific computing in Python. Numpy provides a multidimensional array datatype called ndarray which can do things like vector and matrix computations.

Resources:


In [1]:
# you don't have to rename numpy to np but it's customary to do so
import numpy as np

# you can create a 1-d array with a list of numbers
a = np.array([1, 4, 6])
print 'a:'
print a
print 'a.shape:', a.shape
print 

# you can create a 2-d array with a list of lists of numbers
b = np.array([[6, 7], [3, 1], [4, 0]])
print 'b:'
print b
print 'b.shape:', b.shape
print


a:
[1 4 6]
a.shape: (3,)

b:
[[6 7]
 [3 1]
 [4 0]]
b.shape: (3, 2)


In [2]:
# you can create an array of ones
print 'np.ones(3, 4):'
print np.ones((3, 4))
print

# you can create an array of zeros
print 'np.zeros(2, 5):'
print np.zeros((2, 5))
print

# you can create an array which of a range of numbers and reshape it
print 'np.arange(6):'
print np.arange(6)
print 
print 'np.arange(6).reshape(2, 3):'
print np.arange(6).reshape(2, 3)
print

# you can take the transpose of a matrix with .transpose or .T
print 'b and b.T:'
print b
print 
print b.T
print


np.ones(3, 4):
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]

np.zeros(2, 5):
[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

np.arange(6):
[0 1 2 3 4 5]

np.arange(6).reshape(2, 3):
[[0 1 2]
 [3 4 5]]

b and b.T:
[[6 7]
 [3 1]
 [4 0]]

[[6 3 4]
 [7 1 0]]


In [3]:
# you can iterate over rows
i = 0
for this_row in b:
    print 'row', i, ': ', this_row
    i += 1 
print 
    
# you can access sections of an array with slices
print 'first two rows of the first column of b:'
print b[:2, 0]
print


row 0 :  [6 7]
row 1 :  [3 1]
row 2 :  [4 0]

first two rows of the first column of b:
[6 3]


In [4]:
# you can concatenate arrays in various ways:
print 'np.hstack([b, b]):'
print np.hstack([b, b])
print

print 'np.vstack([b, b]):'
print np.vstack([b, b])
print


np.hstack([b, b]):
[[6 7 6 7]
 [3 1 3 1]
 [4 0 4 0]]

np.vstack([b, b]):
[[6 7]
 [3 1]
 [4 0]
 [6 7]
 [3 1]
 [4 0]]


In [5]:
# note that you get an error if you pass in print 'np.hstack(b, b):'
print np.hstack(b, b)
print


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-93b42dec95f7> in <module>()
      1 # note that you get an error if you pass in print 'np.hstack(b, b):'
----> 2 print np.hstack(b, b)
      3 print

TypeError: hstack() takes exactly 1 argument (2 given)

In [6]:
# you can perform matrix multiplication with np.dot()
c = np.dot(a, b)
print 'c = np.dot(a, b):'
print c
print

# if a is already a numpy array, then you can also use this chained 
# matrix multiplication notation.  use whichever looks cleaner in 
# context
print 'a.dot(b):'
print a.dot(b)
print


# you can perform element-wise multiplication with * 
d = b * b
print 'd = b * b:'
print d
print

a.dot(b)


c = np.dot(a, b):
[42 11]

a.dot(b):
[42 11]

d = b * b:
[[36 49]
 [ 9  1]
 [16  0]]

Out[6]:
array([42, 11])

Arrays and Matrices

In addition to arrays which can have any number of dimensions, Numpy also has a matrix data type which always has exactly 2. DO NOT USE matrix.

The original intention behind this data type was to make Numpy feel a bit more like Matlab, mainly by making the * operator perform matrix multiplication so you don't have to use np.dot. But matrix isn't as well developed by the Numpy people as array is. matrix is slower and using it will sometimes throw errors in other people's code because everyone expects you to use array.


In [7]:
# you can convert a 1-d array to a 2-d array with np.newaxis
print 'a:'
print a
print 'a.shape:', a.shape
print 
print 'a[np.newaxis] is a 2-d row vector:'
print a[np.newaxis]
print 'a[np.newaxis].shape:', a[np.newaxis].shape
print

print 'a[np.newaxis].T: is a 2-d column vector:'
print a[np.newaxis].T
print 'a[np.newaxis].T.shape:', a[np.newaxis].T.shape
print


a:
[1 4 6]
a.shape: (3,)

a[np.newaxis] is a 2-d row vector:
[[1 4 6]]
a[np.newaxis].shape: (1, 3)

a[np.newaxis].T: is a 2-d column vector:
[[1]
 [4]
 [6]]
a[np.newaxis].T.shape: (3, 1)


In [8]:
# numpy provides a ton of other functions for working with matrices
m = np.array([[1, 2],[3, 4]])
m_inverse = np.linalg.inv(m)
print 'inverse of [[1, 2],[3, 4]]:'
print m_inverse
print

print 'm.dot(m_inverse):'
print m.dot(m_inverse)


inverse of [[1, 2],[3, 4]]:
[[-2.   1. ]
 [ 1.5 -0.5]]

m.dot(m_inverse):
[[  1.00000000e+00   0.00000000e+00]
 [  8.88178420e-16   1.00000000e+00]]

In [9]:
# and for doing all kinds of sciency type stuff.  like generating random numbers:
np.random.seed(5678)
n = np.random.randn(3, 4)
print 'a matrix with random entries drawn from a Normal(0, 1) distribution:'
print n


a matrix with random entries drawn from a Normal(0, 1) distribution:
[[-0.70978938 -0.01719118  0.31941137 -2.26533107]
 [-1.37745366  1.94998073 -0.56381007 -0.84373759]
 [ 0.22453858 -0.39137772  0.60550347 -0.68615034]]

Self-Driven Numpy Exercise

  1. In the cell below, add a column of ones to the matrix X_no_constant. This is a common task in linear regression and general linear modeling and something that you'll have to be able to do later today.
  2. Multiply your new matrix by the betas vector below to make a vector called y
  3. You'll know you've got it when the cell prints '****** Tests passed! ******' at the bottom.

Specificically, given a matrix:

\begin{equation*} \qquad \mathbf{X_{NoConstant}} = \left( \begin{array}{ccc} x_{1,1} & x_{1,2} & \dots & x_{1,D} \\ x_{2,1} & x_{2,2} & \dots & x_{2,D} \\ \vdots & \vdots & \ddots & \vdots \\ x_{i,1} & x_{i,2} & \dots & x_{i,D} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N,1} & x_{N,2} & \dots & x_{N,D} \\ \end{array} \right) \qquad \end{equation*}

We want to convert it to: \begin{equation*} \qquad \mathbf{X} = \left( \begin{array}{ccc} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,D} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,D} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{i,1} & x_{i,2} & \dots & x_{i,D} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N,1} & x_{N,2} & \dots & x_{N,D} \\ \end{array} \right) \qquad \end{equation*}

So that if we have a vector of regression coefficients like this:

\begin{equation*} \qquad \beta = \left( \begin{array}{ccc} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_j \\ \vdots \\ \beta_D \end{array} \right) \end{equation*}

We can do this:

\begin{equation*} \mathbf{y} \equiv \mathbf{X} \mathbf{\beta} \end{equation*}

In [14]:
a = np.ones(n_data)[np.newaxis].T
a


Out[14]:
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])

In [16]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

X_no_constant = np.random.randn(n_data, n_dim)
print 'X_no_constant:'
print X_no_constant
print 

# INSERT YOUR CODE HERE!
X = np.hstack([np.ones(n_data)[np.newaxis].T, X_no_constant])
y = np.dot(X, betas)

# Tests:
y_expected = np.array([-0.41518357, -9.34696153, 5.08980544, 
                       -0.26983873, -1.47667864, 1.96580794, 
                       6.87009791, -2.07784135, -0.7726816, 
                       -2.74954984])
np.testing.assert_allclose(y, y_expected)
print '****** Tests passed! ******'


X_no_constant:
[[-0.92232935  0.27352359 -0.86339625  1.43766044 -1.71379871]
 [ 0.179322   -0.89138595  2.13005603  0.51898975 -0.41875106]
 [ 0.34010119 -1.07736609 -1.02314142 -1.02518535  0.40972072]
 [ 1.18883814  1.01044759  0.3108216  -1.17868611 -0.49526331]
 [-1.50248369 -0.196458    0.34752922 -0.79200465 -0.31534705]
 [ 1.73245191 -1.42793626 -0.94376587  0.86823495 -0.95946769]
 [-1.07074604 -0.06555247 -2.17689578  1.58538804  1.81492637]
 [-0.73706088  0.77546031  0.42653908 -0.51853723 -0.53045538]
 [ 1.09620536 -0.69557321  0.03080082  0.25219596 -0.35304303]
 [-0.93971165  0.04448078  0.04273069  0.4961477  -1.7673568 ]]

****** Tests passed! ******

Pandas

Pandas is a python package which adds some useful data analysis features to numpy arrays. Most importantly, it contains a DataFrame data type like the r dataframe: a set of named columns organized into something like a 2d array. Pandas is great.

Resources:


In [19]:
# like with numpy, you don't have to rename pandas to pd, but it's customary to do so
import pandas as pd

b = np.array([[6, 7], [3, 1], [4, 0]])
df = pd.DataFrame(data=b,  columns=['Weight', 'Height'])
print 'b:'
print b
print 
print 'DataFame version of b:'
print df
print


b:
[[6 7]
 [3 1]
 [4 0]]

DataFame version of b:
   Weight  Height
0       6       7
1       3       1
2       4       0


In [20]:
# Pandas can save and load CSV files.  
# Python can do this too, but with Pandas, you get a DataFrame 
# at the end which understands things like column headings
baseball = pd.read_csv('data/baseball.dat.txt')

# A Dataframe's .head() method shows its first 5 rows
baseball.head()


Out[20]:
Salary AVG OBP Runs Hits Doubles Triples HR RBI Walks SO SB Errs free agency eligibility free agent in 1991/2 arbitration eligibility arbitration in 1991/2 Name
0 3300 0.272 0.302 69 153 21 4 31 104 22 80 4 3 1 0 0 0 Andre Dawson
1 2600 0.269 0.335 58 111 17 2 18 66 39 69 0 3 1 1 0 0 Steve Buchele
2 2500 0.249 0.337 54 115 15 1 17 73 63 116 6 5 1 0 0 0 Kal Daniels
3 2475 0.260 0.292 59 128 22 7 12 50 23 64 21 21 0 0 1 0 Shawon Dunston
4 2313 0.273 0.346 87 169 28 5 8 58 70 53 3 8 0 0 1 0 Mark Grace

In [22]:
# you can see all the column names
print 'baseball.keys():'
print baseball.keys()
print

# print 'baseball.Salary:'
# print baseball.Salary
# print 
# print "baseball['Salary']:"
# print baseball['Salary']


baseball.keys():
Index([u'Salary', u'AVG', u'OBP', u'Runs', u'Hits', u'Doubles', u'Triples',
       u'HR', u'RBI', u'Walks', u'SO', u'SB', u'Errs',
       u'free agency eligibility', u'free agent in 1991/2',
       u'arbitration eligibility', u'arbitration in 1991/2', u'Name'],
      dtype='object')


In [23]:
baseball.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337 entries, 0 to 336
Data columns (total 18 columns):
Salary                     337 non-null int64
AVG                        337 non-null float64
OBP                        337 non-null float64
Runs                       337 non-null int64
Hits                       337 non-null int64
Doubles                    337 non-null int64
Triples                    337 non-null int64
HR                         337 non-null int64
RBI                        337 non-null int64
Walks                      337 non-null int64
SO                         337 non-null int64
SB                         337 non-null int64
Errs                       337 non-null int64
free agency eligibility    337 non-null int64
free agent in 1991/2       337 non-null int64
arbitration eligibility    337 non-null int64
arbitration in 1991/2      337 non-null int64
Name                       337 non-null object
dtypes: float64(2), int64(15), object(1)
memory usage: 47.5+ KB

In [24]:
baseball.describe()


Out[24]:
Salary AVG OBP Runs Hits Doubles Triples HR RBI Walks SO SB Errs free agency eligibility free agent in 1991/2 arbitration eligibility arbitration in 1991/2
count 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000 337.000000
mean 1248.528190 0.257825 0.323973 46.697329 92.833828 16.673591 2.338279 9.097923 44.020772 35.017804 56.706231 8.246291 6.771513 0.397626 0.115727 0.192878 0.029674
std 1240.013309 0.039546 0.047132 29.020166 51.896322 10.452001 2.543336 9.289934 29.559406 24.842474 33.828784 11.664782 5.927490 0.490135 0.320373 0.395145 0.169938
min 109.000000 0.063000 0.063000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 230.000000 0.238000 0.297000 22.000000 51.000000 9.000000 0.000000 2.000000 21.000000 15.000000 31.000000 1.000000 3.000000 0.000000 0.000000 0.000000 0.000000
50% 740.000000 0.260000 0.323000 41.000000 91.000000 15.000000 2.000000 6.000000 39.000000 30.000000 49.000000 4.000000 5.000000 0.000000 0.000000 0.000000 0.000000
75% 2150.000000 0.281000 0.354000 69.000000 136.000000 23.000000 3.000000 15.000000 66.000000 49.000000 78.000000 11.000000 9.000000 1.000000 0.000000 0.000000 0.000000
max 6100.000000 0.457000 0.486000 133.000000 216.000000 49.000000 15.000000 44.000000 133.000000 138.000000 175.000000 76.000000 31.000000 1.000000 1.000000 1.000000 1.000000

In [26]:
#  baseball

In [34]:
# You can perform queries on your data frame.  
# This statement gives you a True/False vector telling you 
# whether the player in each row has a salary over $1 Million
millionaire_indices = baseball['Salary'] > 1000
# print millionaire_indices

In [28]:
# you can use the query indices to look at a subset of your original dataframe
print 'baseball.shape:', baseball.shape
print "baseball[millionaire_indices].shape:", baseball[millionaire_indices].shape


baseball.shape: (337, 18)
baseball[millionaire_indices].shape: (139, 18)

In [33]:
# you can look at a subset of rows and columns at the same time
print "baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']]:"
baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']].head()


baseball[millionaire_indices][['Salary', 'AVG', 'Runs', 'Name']]:
Out[33]:
Salary AVG Runs Name
0 3300 0.272 69 Andre Dawson
1 2600 0.269 58 Steve Buchele
2 2500 0.249 54 Kal Daniels
3 2475 0.260 59 Shawon Dunston
4 2313 0.273 87 Mark Grace

Pandas Joins - If you have time

The real magic with a Pandas DataFrame comes from the merge method which can match up the rows and columns from two DataFrames and combine their data. Let's load another file which has shoesize for just a few players


In [30]:
# load shoe size data
shoe_size_df = pd.read_csv('data/baseball2.dat.txt')
shoe_size_df


Out[30]:
Shoe Size Name
0 11 Andre Dawson
1 13 Mark Grace
2 12 Sammy Sosa

In [31]:
merged = pd.merge(baseball, shoe_size_df, on=['Name'])
merged


Out[31]:
Salary AVG OBP Runs Hits Doubles Triples HR RBI Walks SO SB Errs free agency eligibility free agent in 1991/2 arbitration eligibility arbitration in 1991/2 Name Shoe Size
0 3300 0.272 0.302 69 153 21 4 31 104 22 80 4 3 1 0 0 0 Andre Dawson 11
1 2313 0.273 0.346 87 169 28 5 8 58 70 53 3 8 0 0 1 0 Mark Grace 13
2 200 0.203 0.240 39 64 10 1 10 33 14 96 13 6 0 0 0 0 Sammy Sosa 12

In [32]:
merged_outer = pd.merge(baseball, shoe_size_df, on=['Name'], how='outer')
merged_outer.head()


Out[32]:
Salary AVG OBP Runs Hits Doubles Triples HR RBI Walks SO SB Errs free agency eligibility free agent in 1991/2 arbitration eligibility arbitration in 1991/2 Name Shoe Size
0 3300 0.272 0.302 69 153 21 4 31 104 22 80 4 3 1 0 0 0 Andre Dawson 11.0
1 2600 0.269 0.335 58 111 17 2 18 66 39 69 0 3 1 1 0 0 Steve Buchele NaN
2 2500 0.249 0.337 54 115 15 1 17 73 63 116 6 5 1 0 0 0 Kal Daniels NaN
3 2475 0.260 0.292 59 128 22 7 12 50 23 64 21 21 0 0 1 0 Shawon Dunston NaN
4 2313 0.273 0.346 87 169 28 5 8 58 70 53 3 8 0 0 1 0 Mark Grace 13.0

Self-Driven Pandas Exercise

  1. Partner up with someone next to you. Then, on one of your computers:

    1. Prepend a column of ones to the dataframe X_df below. Name the new column 'const'.
    2. Again, matrix multiply X_df by the betas vector and assign the result to an new variable: y_new
    3. You'll know you've got it when the cell prints '****** Tests passed! ******' at the bottom.

    Hint: This stackoverflow post may be useful: http://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns


In [36]:
np.random.seed(3333)
n_data = 10 # number of data points. i.e. N
n_dim = 5   # number of dimensions of each datapoint.  i.e. D

betas = np.random.randn(n_dim + 1)

X_df = pd.DataFrame(data=np.random.randn(n_data, n_dim))

# INSERT YOUR CODE HERE!
X_df['const'] = np.ones(n_data)
y_new = np.dot(X_df, betas)

# Tests:
assert 'const' in X_df.keys(), 'The new column must be called "const"'
assert np.all(X_df.shape == (n_data, n_dim+1))
assert len(y_new == n_data)
print '****** Tests passed! ******'


****** Tests passed! ******

In [37]:
X_df


Out[37]:
0 1 2 3 4 const
0 -0.922329 0.273524 -0.863396 1.437660 -1.713799 1.0
1 0.179322 -0.891386 2.130056 0.518990 -0.418751 1.0
2 0.340101 -1.077366 -1.023141 -1.025185 0.409721 1.0
3 1.188838 1.010448 0.310822 -1.178686 -0.495263 1.0
4 -1.502484 -0.196458 0.347529 -0.792005 -0.315347 1.0
5 1.732452 -1.427936 -0.943766 0.868235 -0.959468 1.0
6 -1.070746 -0.065552 -2.176896 1.585388 1.814926 1.0
7 -0.737061 0.775460 0.426539 -0.518537 -0.530455 1.0
8 1.096205 -0.695573 0.030801 0.252196 -0.353043 1.0
9 -0.939712 0.044481 0.042731 0.496148 -1.767357 1.0

In [ ]: